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1 I Abstract. Gibbs random fields (GRF) are polymorphous statistical models that 
^^ can be used to analyse different types of dependence, in particular for spatially 
r ^ correlated data. However, when those models are faced with the challenge of 

• selecting a dependence structure from many, the use of standard model choice 

rrt methods is hampered by the unavailability of the normalising constant in the 

"*T* Gibbs likelihood. In particular, from a Bayesian perspective, the computation 

I I of the posterior probabilities of the models under competition requires special 

likelihood-free simulation techniques like the Approximate Bayesian Computation 

CO (ABC) algorithm that is intensively used in population genetics. We show in 

^ this paper how to implement an ABC algorithm geared towards model choice in 

the general setting of Gibbs random fields, demonstrating in particular that there 

exists a sufficient statistic across models. The accuracy of the approximation to 

the posterior probabilities can be further improved by importance sampling on 

the distribution of the models. The practical aspects of the method are detailed 

t through two applications, the test of an iid Bernoulli model versus a first-order 

^-^ Markov chain, and the choice of a folding structure for two proteins. 

OO 

T\ Keywords: Approximate Bayesian Computation, model choice, Gibbs Random Fields, 
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cd 1 Introduction 

1.1 Gibbs random fields 



We consider a finite set of sites 5 = {1, • • • ,n}. At each site i G S, we observe Xj € Xi 
where Xi is a finite set of states. X — HiLi '^* ^^ ^^^ ^^^ "-"^ ^^'^ configurations, x = 
(xi, • • • ,Xn) corresponding to one configuration. We also consider an undirected graph 
G = {E{G),V{Q)) on S, V{G) being a vertex set and E{Q) an edge set. The sites i and i 
are said neighbours (denoted z ~ i ) if (z, i ) S E{Q), in other words, if there is a vertex 
between i and i . A clique c is a subset of S where all elements are mutual neighbours 
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(Darroch et al. 1980). We denote by C the set of all cliques of the undirected graph Q. 



In the finite framework previously adopted, Gibbs Random Fields (GRFs) are prob- 
abilistic models associated with densities (with respect to the counting measure) 



/(x) = 1 exp{-t/(x)} = I cxp I - ^ C/,(x) I 

I cec ) 



(1) 



where C/(x) = '^^f^(^ Uc(x) is the potential and Z is the corresponding normalising 
constant 



Z=^exp|-^C/,(x)l 
xex [ cec ) 



If the density / of a Markov Random Field (MRF) is everywhere positive, then the 
Hammersley-ClifFord theorem establishes that there exists a GRF representation of this 



MRF (Besag) 19741 



We consider here GRF with potential [/(x) = 5(x) where 9 £ W is a, scale 
parameter, S{-) is a function taking values in W. S'(x) is defined on the cliques of the 
neighbourhood system in that S'(x) — J^cec '5'c(x). In that case, we have 



/(x|e) = ^exp{0T5(x)}, 
the normalising constant Zg now depends on the scale parameter 6. 



(2) 



1.2 Bayesian model choice 

When considering model selection within this class of Gibbs models, the primary diffi- 
culty to address is the unavailability of the normalising constant Zg. In most realistic 
settings, the summation 

Zg=Y, exp{e^S{^)} 
xex 

involves too many terms to be manageable. Numerical approximations bypassing this 
constant like path sampling (Gelman and Meng 19981, pseudo likelihood (Besag 19751 



or those based on an auxiliary variable (M0ller et al. 2006J are not always available 
either because they require heavy computations or because they are not accurate enough 
in the case of the pseudo-likelihood. In particular, selecting a model with sufficient 
statistic ^o taking values in M.p° versus a model with sufficient statistics ^i taking 
values in M^^ relies on the Bayes factor corresponding to the priors ttq and tti on the 
respective parameter spaces 

BF^„/„,(x) = y"exp{0jS'o(x)}/Ze„,o7ro(d0o) / 



exp{0^5i(x)}/Ze,,i7ri(d0i) 
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but this quantity is not easily computable. One faces the same computational difficulties 
with the posterior probabilities of the models since they also depend on those unknown 
constants. To properly approximate those posterior quantities, we thus propose an 
alternative resolution based on likelihood-free techniques such as Approximate Bayesian 



Computation (ABC) (Pritchard et al. 19991 and we show how ABC is naturally tuned 
for this purpose by providing a direct estimator of the Bayes factor. 

^From a modelling perspective, GRF are used to model the dependency within 



spatially correlated data, with applications in epidemiology (Green and Richardson] 
[2002) and image analysis (Ibanez and Simo 2003), among others (Rue and Held, 2005] 



They often use a Potts model defined by a sufficient statistic S taking values in 
that 






^(x)-E: 



■{Xi—X^/ } 5 



where X]i'~i indicates that the summation is taken over all the neighbour pairs. In that 
case, X — {I,- ■ ■ , Ky\ K — 2 corresponding to the Ising model, and 6* is a scalar. S{-) 
therefore monitors the number of identical neighbours over X. 

1.3 Plan 

For a fixed neighbourhood or model, the unavailability of Zg complicates inference on the 
scale parameter 9, but the difficulty is increased manifold when several neighbourhood 
structures are under comparison. In section [21 we describe the main likelihood-free 
algorithms before proposing a procedure based on an ABC algorithm aimed at selecting 
a model. Then, we show how to improve the accuracy of this approximation using an 
importance sampling procedure. In section [3| we consider the toy example of an iid 
sequence [with trivial neighbourhood structure] tested against a Markov chain model 
[with nearest neighbour structure] as well as a biophysical example aimed at selecting 
a protein 3D structure. 



2 Methods 

2.1 Approximate Bayesian Computation 

When the likelihood is not available in closed form, there exist likelihood- free meth- 
ods that overcome the difficulty faced by standard simulation techniques via a basic 
acceptance-rejection algorithm. The algorithm on which the ABC method [introduced 
by Pritchard et al. (19991 and expanded in Beaumont et al. (20021 and Marjoram et al. 



(2003l] is based can be briefly described as follows: given a dataset x''' = (a;i, • • • ,Xn) 
associated with the sampling distribution f{-\9), and under a prior distribution 7r(6') on 
the parameter 9, this method generates a parameter value from the posterior distribu- 
tion 7r(0|x°) C!C Tr{9)f{x'^\9) by simulating jointly a value 9* from the prior, 9* ~ 7r(-), 
and a value x* from the sampling distribution x* ~ f{-\9*) until x* is equal to the 
observed dataset x°. The rejection algorithm thus reads as 
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Exact rejection algorithm: 

1. Generate 6* from the prior tt. 

2. Generate x* from the model f{-\0*). 

3. Accept 9* if X* = x", otherwise, start again in 1. 



This solution is not approximative in that the output is truly simulated from the pos- 
terior distribution 7r(6l|xO) ex /(xO|6l)7r(6') since (6'*,x*) - 7r(6'*)I{x*=xO}/(x*|6l). In 
many settings, including those with continuous observations x", it is however imprac- 
tical or impossible to wait for x* = x'' to occur and the approximative solution is to 
introduce a tolerance in the test, namely to accept 9* if simulated data and observed 
data are close enough, in the sense of a distance p, given a fixed tolerance level e, 
/9(x*,x°) < e. The distance p is open to choice but is usually an Euchdean distance 



p(x* 



ELi(^ 



^0\2 



(see Beaumont et al. (20021 or Blum and Frangois (2008l) 



The corresponding e-tolerance rejection algorithm is then 



e-tolerance rejection algorithm: 

1. Generate 9* from the prior tt. 

2. Generate x* from the model f{-\9*). 

3. Accept 9* if /j(x*,x'') < e, otherwise, start again in 1. 



This approach is obviously approximative when e 7^ 0. The output from the e-tolerance 
rejection algorithm is thus associated with the distribution 

^(0|p(x*,x°) < e) ex ^(^)Pe(p(x*,x°) < e) 



The choice of e is therefore 
e is too large, the approxima- 



with Pe(p(x*,xO) < e) = /I{p(,.,,o)<,}/(x*|r)dx*. 
paramount for good performances of the method. If 
tion is poor; when e ^ 00, it amounts to simulating from the prior since all simula- 
tions are accepted (as Pe(p(x*,x°) < e) ^ 1 when e -^ 00). If e is sufficiently small, 
7r(0|/9(x*, x") < e) is a good approximation of 7r(0|x''). There is no approximation when 
e = 0, since the e-tolerance rejection algorithm corresponds to the exact rejection al- 
gorithm, but the acceptance probability may be too low to be practical. Selecting the 
"right" e is thus crucial. It is customary to pick e as an empirical quantile of p(x*,x°) 
when X* is simulated from the marginal distribution x* ex J TT{9)Fg{p{x* ,x^) < e)d9, 
and the choice is often the corresponding 1% quantile (see, for instance Beaumont et al. 
(20021 or Blum and Frangois ( 2008[ )). Wilkinson ( 2008[ ) propose to replace the approx- 
imation by an exact simulation based on a convolution with an arbitrary kernel. 

The data x'' usually being of a large dimension, another level of approximation is 
enforced within the true ABC algorithm, by replacing the distance p(x*,x°) with a 
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corresponding distance between summary statistics p{S{x*),S{x'^)) ( Beaumont et al 
[2002). When 5* is a sufficient statistic, this step has no impact on the approximation 
since tt{9\p{S{x*),S{x^)) = 7r(6'|p(x*,x°). In practice, it is rarely the case that a suf- 



ficient statistic of low dimension is available when implementing ABC (see Beaumont 
et al. ((2002) or Blum and Frangois (12008*)). As it occurs, the setting of model choice 



among Gibbs random fields is an exception in that it allows for such a beneficial struc- 
ture, as will be shown below. In the general case, the output of the ABC algorithm is 
therefore a simulation from the distribution 7t{9\p{S{x*), S{x'^)) < e). The algorithm 
reads as follows: 

ABC algorithm: 

1. Generate 9* from the prior tt. 

2. Generate x* from the model f{-\9*). 

3. Compute the distance p(S'(x°), 5(x*)). 

4. Accept 9* if p(S'(x"), S'(x*)) < e, otherwise, start again in 1. 



2.2 Model choice via ABC 

In a model choice perspective, we face M Gibbs random fields in competition, each 
model m being associated with sufficient statistic Sm (0 < m < Af — 1), i.e. with 
corresponding likelihood 

/m(x|6'„i) = exp {9'^^S„i{x)} /Zg^^rri , 

where 9m G ©m and Z^^^ra is the unknown normalising constant. Typically, the choice 
is between M neighbourhood relations z ~ i' (0 < m < M — 1) with S'm(x) = 
^.m, Ij-j,.^^.,,}. ^From a Bayesian perspective, the choice between those models is 
driven by the posterior probabilities of the models. Namely, if we consider an extended 
parameter space 6 = U^Zq {m} x 6m that includes the model index M. , we can define 
a prior distribution on the model index 7r(A^ = m) as well as a prior distribution on 
the parameter conditional on the value to of the model index, Trm{9m), defined on the 
parameter space Qm- The computational target is thus the model posterior probability 



F{M = m|x) ex / fm{x\9m)Trm{dm) d9m 7r(7V/( = to) , 

i.e. the marginal of the posterior distribution on {A4, 9q, . . . , ^j\/-i) given x. Therefore, 
if 5'(x) is a sufficient statistic for the joint parameters {M.,9q, . . . , 9m-i), 

f{M = m|x) = V{M = m\S{x)) . 

Each model has its own sufficient statistic S'm(-). Then, for each model, the vector 
of statistics S{-) = {Sq{-),. . . ,Sm-i{')) is obviously sufficient (since it includes the 
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sufficient statistic of each model). Moreover, the structure of the Gibbs random field 
allows for a specific factorisation of the distribution /m(x|6'm,). Indeed, the distribution 
of X in model m factorises as 

fm{'^\0m) = hrn{x\S{x))gm{S{x)\dm) 

where gm{S (x)\9m) is the distribution of S'(x) within model m [not to be confused with 
the distribution of S'm(x)] and where 

n{Siic)) ^l{xeX : S{i) = Six)} 

is the cardinality of the set of elements of X with the same sufficient statistic, which does 
not depend on m (the support of fm is constant with to). The statistic S'(x) is therefore 
also sufficient for the joint parameters {M,9f), . . . ,6'm-i)- That the concatenation of 
the sufficient statistics of each model is also a sufficient statistic for the joint parameters 
{Ai,6o, . . . , Om~i) is obviously a property that is specific to Gibbs random field models. 

Note that when we consider M models from generic exponential families, this prop- 
erty of the concatenated sufficient statistic rarely holds. For instance, if under model 
Al = 0, Xi\6o ^ V{Oa) and under model M. = \, Xi\6i ^ Qeo{9i), this property is not 
satisfied since the distribution of x given the common 5(x) = X]i=i ^« i^ ^^"^ f^''^^ model 

/io(x|5(x)) = 






n: 



X. 



is different from the distribution of x given 5(x) in the other one 

/ii(x|S'(x)) ^ 

n{S{:ic)) 

As a consequence, >5'(x) is not sufficient for the parameter M. For Gibbs random fields 
models, it is possible to apply the ABC algorithm in order to produce an approximation 
with tolerance factor e: 

ABC algorithm for model choice (ABC-MC): 

1. Generate to* from the prior 7r(A^ = to). 

2. Generate 6*^. from the prior tt„i*{-). 

3. Generate x* from the model /,„. (-j^^*). 

4. Compute the distance p{S{x'^), S{x*)). 

5. Accept {6^,,m*) if p(5(x"),S'(x*)) < e, otherwise, start again in 1. 
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Simulating a data set x* from /„. (-I^m* ) at step 3 is often non-trivial for GRFs. For 
the special case of the Ising model considered in the examples below, there have been 
many developments from Besag ( 1974 ) to M0ller and Waagepetersen ( 2003 1 that allow 
for exact simulation via perfect sampling. We refer the reader to 'HaggstromI (20021, 
M0ller (2003) and M0ller and Waagepetersen (20031, for details of this simulation tech- 
nique and for a discussion of its limitations. For other GRFs it is often possible to use 
a Gibbs sampler updating one clique at a time conditional on the others. This solution 
was implemented for the biophysical example of Section |3.2[ 



For the same reason as above, this algorithm results in an approximate generation 
from the joint posterior distribution 

7r{(M,0o,-..,eM-i)|p(5(x"),5(x*))<e} . 



When it is possible to achieve e = 0, the algorithm is exact since 5* is a sufficient 
statistic. We have thus derived a likelihood- free method to handle model choice. 

Once a sample of N values of {9"i,,m") (1 < J < N) is generated from this algo- 
rithm, a standard Monte Carlo approximation of the posterior probabilities is provided 
by the empirical frequencies of visits to the model, namely 

P(A^ = m|x°) = tt{m" = m}/N , 

where tt{m** = m\ denotes the number of simulated rn**'s equal to m. Correlatively, the 
Bayes factor associated with the evidence provided by the data x*^ in favour of model 
mo relative to model ?7ii is defined by 



BF, 



o/mi(^ ) 



V{M = TOo|x°) t:{M = mi) 



V{M = 7711 |xO) 7r(7W = mo) 
//mo(x°|go)^o(go)7r(X ^ mo)dgo ^{M ^ mQ 
//„,(xO|0i)^i(0i)7r(X = mi)d0i7r(X=mo) 



(3) 
(4) 



The previous estimates of the posterior probabilities can then be plugged-in to approx- 
imate the above Bayes factor by 



BF 



mo/i 



..(x°) 



V{M 



V{M = mi|xO) 

tJ{m* 



mo|x°) ^{M 



mi) 



tt{r 



mo} tt{M 

X 



tt{M = mo) 
mi) 



but this estimate is only defined when tl{r77, 
substitute 

i*-^mo/mi(x ) - - 



= mi} tt{M = mo) 
= 7771 } 7^ 0. To bypass this difficulty, the 
t:{M = mi) 



+ tt{777** ~ mi} 7r(Al = 7Tlo) 

is particularly interesting because we can evaluate its bias. (Note that there does not 
exist an unbiased estimator of BF^g/^^{x°) based on the m"'s.) Indeed, assuming 
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without loss of generality that 7r(A^ = rrii) = 7r(A^ = mo), if we set Nq = 'i{m" = toq}, 
-/Vi = tt{m" = nil} then conditionally on A*" = A'o + -/Vi, -/Vi is a binomial B{N,p) rv 
with probability p = (1 + _Bi^„jjj/„j^(x°))^^. It is then straightforward to establish that 



E 



A^n + l 



A^i + l 



N 



- BF / (ii°] + ^ ^ + ^ (l-n]^+^ 



The bias in the estimator i??,„(,/,„j(x") is thus {1 - (TV + 2)(1 - p)^+i}/(A^ + l)p, 
which goes to zero as N goes to infinity. 

BFjyig/jy^^ (x'^) can be seen as the ratio of the posterior means on the model probabil- 
ities p under a 'Dir{l, ■ ■ ■ ,1) prior. In fact, if we denote Nj = tt{m** = m,j}, A^ = Et^o 
then the vector (A^i, • ■ ■ , Nm) has a multinomial distribution 

(A^o,--- ,^A/-ibo,--- ,Pm-i) ^ M{N;po,--- ,pm-i) ■ 

The corresponding posterior distribution on p is a I?ir(l + Nq, ■ ■ ■ , 1 + Nm-i) and 

^ _ E\po\No,--- ,Nm-i] ^ iVo + 1 

"°/"^^^ IE[pi|iVo,---,7VM_i] iVi + 1 

is a consistent estimate of i?F,„g/mj(x*'). 

Since the distribution of the sample (6*^;. , m.**)(i<j<jv) is not exactly tt {(A4, Oq, ■ ■ ■ , 6'm-i)|x"} 
but TT {{A4,9q, . . . ,9m-i)\p{S{x'^),S{x*)) < e}, the Bayes factor should be written as 

^ PjM ^ mo\piSi^% g(x*)) < e) 7r(X = mQ 

^ /^{(M = mo,go)|p(g(x°),g(x*)) < 6}dgo ^(Al ^ mi) 

/7r{(M = mi,6li)|p(S'(xO),5(x*)) < e}d6li 7r(Al = mo) 
^ /Pe„(p(5(x"),5(x*)) < e)7ro(go)7r(A^ ^ mo)d0o ^(X ^ mQ 

/Pei(p(5(xO),5(x*)) < e)7ri(0i)7r(A1 = mi)d0i 7t{M = mo) 
_ / [//mo(x*|6>o)7ro(6lo)d6>o] I{p(s(xO),s(x'))<£}dx* 

/ [//mi(x*|6'i)7ri(6li)d6'i] I{p(s(xO),s(x-))<e}dx* 

When e = and S'(x) is a sufficient statistic, this expression corresponds to equation 

2.3 Two step ABC 

The above estimator i3i^„g/^j(x'^) is rather unstable (i.e. it suffers from a large vari- 
ance) when -Bi^m(,/„ij (x") is very large since, when P(A^ = nii |x") is very small, tt{m** = 
mi} is most often equal to zero. This difficulty can be bypassed by a reweighting 
scheme. If the choice of m* in the ABC algorithm is driven by the probability distribu- 
tion P{M = mi) = g = 1 -P(A^ = mo) rather than by it{M = mi) = 1 — ^{M — mo). 
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the value of tt{77i** = rrii} can be increased and later corrected by considering instead 

1 + tJ{m''* = mo} g 



BF 



' 1 + ft{m'* = mil l — g 

Therefore, if a first run of the ABC algorithm exhibits a very large value of BFma/mi (x") , 
the estimate i?_F„p /„j^(x'^) produced by a second run with 



Q(x I V{M = mi|x") 



will be more stable than the original BF^^i^^ (x*^)- I^i the most extreme cases when no 
m** is ever equal to mi, this corrective second is unlikely to bring much stabilisation, 
though. ^From a practical point of view, obtaining a poor evaluation of BFj^^/raii'^^) 
when the Bayes factor is very small (or very large) has limited consequences since the 
poor approximation also leads to the same conclusion about the choice of model mo- 
Note, however, that, when there are more than two models, using these approximations 
to perform Bayesian model averaging can be dangerous. 



3 Results 

3.1 Toy example 

Our first example compares an iid Bernoulli model with a two-state first-order Markov 
chain. Both models are special cases of GRF, the first one with a trivial neighbourhood 
structure and the other one with a nearest neighbourhood structure. Furthermore, the 
normalising constant Zg^„i can be computed in closed form, as well as the posterior 
probabilities of both models. We thus consider a sequence x = (a;i,..,x„) of binary 
variables. Under model Al = 0, the GRF representation of the Bernoulli distribution 
6(exp(0o)/{l + exp(0o)})is 

/o(x|0o) - cxp 1 00 J2 1{-.=i} ) /{I + cxp(^o)}" , 

associated with the sufficient statistic 5'o(x) ~ '^"=i^{xi=i} ^tnd the normalising con- 
stant Zffg^o = (1 -I- e^")". Under a uniform prior 6*0 ^ U{~b, 5), the posterior probability 
of this model is available since the marginal when S'o(x) = sq (sq ^ 0) is given by 



10 ^^V k n-l-k 

k=0 ^ ' 

by a straightforward rational fraction integration. 

Model Al = 1 is chosen as a Markov chain (hence a particular GRF in dimension 
one with i and i' being neighbours if |i — i'| = 1) with the special feature that the 
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probability to remain within the same state is constant over both states, namely 

P(a;,+i = x^\x^) = exp(6li)/{l + exp(6li)} . 

We assume a uniform distribution on a:;i and the likelihood function for this model is 
thus 

/i(x|0i) = iexp f^i f] !{..=.,_,} j /{l+exp(0i)r-i, 

with 5i(x) = J2i=2^{xi=xi-i} being the sufficient statistic and ^e^,! = 2(1 + e^")"^^ 
being the normalising constant in that case. Under a uniform prior 6i ^ U{0,6), 
the posterior probability of this model is once again available, the likelihood being of 
the same form as when Al = 0. The bounds of the prior distributions on 9q and 0i 
were chosen to avoid data sets consisting in a sequence of n identical values since it is 
impossible to distinguish model and model 1 in that case. 

We are therefore in a position to evaluate the ABC approximations of the model 
posterior probabilities and of the Bayes factor against the exact values. For this purpose, 
we simulated 1000 datasets x" = {xi, • • • , a;„) with n — 100 under each model, using 
parameters simulated from the priors and computed the exact posterior probabilities 
and the Bayes factors in both cases. For each of those 2000 datasets x°, the ABC-MC 
algorithm was run for 4x10^ loops, meaning that 4 x 10^ sets {m* , 6'*„. , x* ) were exactly 
simulated from the joint distribution and a random number of those were accepted when 
^(x*) = S'(x''). (In the worst case scenario, the number of acceptances was 12!) As 
shown on the left graph of Figure [!} the fit of the approximate posterior probabilities is 
good for all values of V{M = 0|x°). When we introduce a tolerance e equal to the 1% 
quantile of p(S'(x*'), S'(x*)), p being the Euclidean distance, the results are similar when 
¥{A4 — 0|x°) is close to 0, 1 or 0.5, and we observe a slight difference for other values. 
We also evaluated the approximation of the Bayes factor (and of the subsequent model 
choice) against the exact Bayes factor. As clearly pictured on the left graph of Figure 
[2J the fit is good in the exact case (e = 0), the poorest fits occurring in the limiting 
cases when the Bayes factor is either very large or very small and thus when the model 
choice is not an issue, as noted above. In the central zone when log BF^na/mA'^^) is 
close to 0, the difference is indeed quite small, the few diverging cases being due to 
occurrences of very small acceptance rates. If we classify the values of BF^^/^^{x.^) 

and BFjngfjnii'^^) according to the Jeffrey's scale, we observe that the Bayes factor 
and its approximation belong to the same category (1903 simulated data sets are on the 
diagonal of Table fTl) or to very close categories. Once more, using a tolerance e equal to 
the 1% quantile does not bring much difference in the output. Table [2] shows that the 
Bayes factor and its estimation still belong to the same category for 1805 simulated data 
sets. The approximative Bayes factor is slightly less discriminative in that case, since 
the slope of the cloud is less than the unitary slope of the diagonal on the right graph of 
Figurebl i?i^„j„/TOj(x°) and BF^no/mii^'^) lead to the selection of the same model, but 
with a lower degree of confidence for the second one (Table [2]). The boxplots on Figure 
KB^compare the distributions of the ratios BFrno/mi{^^)/BFjng/mj^{;x°) in the exact case 
and using a tolerance equal to the 1% quantile on the distances. As reported in Table [3J 
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Figure 1: (left) Comparison of the true F{M = 0\x°) with P{M ^ 0|x°) over 2,000 
simulated sequences and 4 x 10^ proposals from the prior. The red line is the diagonal. 
(right) Same comparison when using a tolerance e corresponding to the 1% quantile on 
the distances. 



the median is very close to 1 in both cases. The ratio takes more often extreme values 
in the exact case. Once more, this is a consequence of the poor estimation of the Bayes 
factor when the acceptance rate is small. Given that using the tolerance version allows 
for more simulations to be used in the Bayes factor approximation, we thus recommend 
using this approach. 



3.2 Application to protein 3D structure prediction 

The numerous genome sequences now available provide a huge amount of protein se- 
quences whose functions remain unknown. A classical strategy is to determine the 
tridimensional (3D) structure of a protein, also called fold, as it provides important and 
valuable information about its function. Experimental methods, like those based on 
X-ray diffraction or nuclear magnetic resonance, provide accurate descriptions of 3D- 
structures, but are time consuming. As an alternative, computational methods have 
been proposed to predict 3D structures. 

These latter methods mostly rely on homology (two proteins are said to be homolo- 
gous if they share a common ancestor). In fact, homologous proteins often share similar 
function and, as function is controlled by structure, similar structure. When the protein 
under study, hereafter called the query protein, can be considered as homologous with 
another protein, a prediction of its 3D structure based on the structure of its homolog 
can be built. 

First, one compares the sequence of the query protein with a data bank of sequences 
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Figure 2: (left) Comparison of the true BFj-^^^^^ (x") with -B-Fmo/mi i^'^) (^^ logarith- 
mic scales) over 2,000 simulated sequences and 4 x 10^ proposals from the prior. The 
red line is the diagonal, (right) Same comparison when using a tolerance corresponding 
to the 1% quantile on the distances. 
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Figure 3: (left) Boxplots of the ratios BFmg/mi{^'^)/BF„ig^mi{^'^) (in logarithmic 
scales) in the exact case and using a tolerance equal to the 1% quantile on the distances 
over 2,000 simulated sequences and 4 x 10^ proposals from the prior. 



Grelaud, Marin, Robert, Rodolphe and Taly 



13 





TO = 1 
dec. 


TO = 1 

str. 


TO= 1 

sub. 


TO= 1 

weak 


TO=0 

weak 


TO=0 

sub. 


m = 
str. 


TO=0 

dec. 


7n = \, dec. 


778 


9 




















TO = 1, str. 


2 


79 




















TO = 1, sub. 





7 


53 

















7n = I, weak 








2 


63 





7 








7n = 0, weak 











22 


103 


7 








TO = 0, sub. 














1 


103 


23 





TO = 0, str. 

















5 


177 


6 


TO = 0, dec. 




















13 


547 



Table 1: Comparison of the decisions based on BFj^o/mii^'^) and on BF„ig/m^{x^) 
using e = according to the Jeffrey's scale (dec: decisive log(i3Fmo/mi(x°)) > 2, str.: 
strong 1 < log(BF„„/^^(xO)) < 2, sub.: substantial 0.5 < log(Bi^„„/„Jx")) < 1, 
weak < log(BF^„/„,(xO)) < 0.5). 
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Table 2: Comparison of the decisions based on BFmo/mii^^) and on BFmo/mii^^) ac- 
cording to the Jeffrey's scale, using a tolerance e corresponding to the 1% quantile of the 
distances (dec: decisive log(i3F„„/„j(x°)) > 2, str.: strong 1 < log(i?i^mg/mi(x°)) < 
2, sub.: substantial 0.5 < log(i?F„„/„j^(x°)) < 1, weak < log(i3F„„/„j(x°)) < 0.5). 





90.25 


90.5 


90.75 


e = 


0.914 


1.041 


22.9 


e == 9i% 


0.626 


1.029 


7.9 



Table 3: Quantiles of the ratios BFma/mi{'^^)/BFmo/mx{'y-^) in the exact case and 
using a tolerance e equal to the 1% quantile of the distances. 
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Table 4: Classification of amino acids into hydrophilic and hydrophobic groups. 



Hydrophilic 



KERDQNPHSTG 



Hydrophobic 



AYMWFVLIC 



of proteins of known structures but sequence similarity is often too low to assess homol- 
ogy with sufficient certainty. Because of selection pressure on the function, structures 
are more conserved over time than sequences. Threading methods consist in aligning 
the query sequence onto a set of structures representative of all known folds. The se- 
quence of the query is threaded onto the candidate structures in order to find the most 
compatible one. A score (a fitting criterion) is computed for each proposal. Structures 
displaying sufficiently high scores, if any, are chosen as the corresponding protein can 
be said homologous with the query protein. 

It may happen that both information based on sequence similarity and threading 
score are not sufficient to access protein homology and consequently, to select a 3D 
structure. Our aim is to use extra information to help making a decision, if necessary. 
We use here the fact that amino acids in close contact in the 3D structure often share 
similar (or complementary) biochemical properties. In the example we discuss in this 
section, we use hydrophobicity as a clustering factor since hydrophobic amino-acids are 
mostly buried inside the 3D structure, and hydrophilic ones exposed to water. This 
effect is observed in almost all proteins. 

^From a formal perspective, each structure can be represented by a graph where 
a node represents one amino-acid of the protein and an edge between two nodes indi- 
cates that both amino-acids are in close contact in the folded protein (hence are neigh- 
bours) . Labels are allocated to each node, associated with hydrophobicity of amino-acids 
(amino-acids are classified as hydrophobic or hydrophilic according to Table l4| . Then, 
a Gibbs random field, more precisely an Ising model, can be defined on each graph. 
When several structures are proposed by a threading method, the ABC-MC algorithm 
is then available to select the most likely structure. 

We applied this procedure to proteins of known structures (here called the native 
structures) ItqgA, involved into signal transduction processes in the bacterium Ther- 
motoga maritima and lk77A which is a putative oxygenase from Escherichia coli. In 
these studies, the sequences were treated as queries, since our purpose was to evaluate 
if our idea could help in real situations. 



We used FROST ([Marin et al.[|2002[ ), a software dedicated to threading, and MOD- 
ELLER (Sa h and Blundell[|1993f to find the candidate structures and KAKSI ( [Martin 



[et al.(|2005| ) to build the graphs. All candidate structures were picked up from the Pro- 
tein Data Bank (http://www.rcsb.org/pdb/home/home.do). FROST provides the best 
alignment of the query sequence onto a structure, based on score optimisation, and the 
final score measures alignment quality. A score larger than 9 means that the alignment 
is good, while a score less than 7 means the opposite. For values between 7 and 9, this 
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STl 





ST3 



Figure 4: Superposition of the native structure of ItqgA (grey) with the STl structure 
(red), the ST2 structure (orange), the ST3 structure (green), and the DT structure 
(blue). 



score cannot be used to reach a decision. Additionally, FROST calculates the percent- 
ages of identity between query and candidate sequences; sequences with a percentage 
of sequence identity higher than > 20% can be considered as homologous. 

As the native structures were known, similarities between candidate and native 



structures could be assessed, here by the TM-score, (Zhang and Skolnick 20041. A score 



larger than 0.4 implies both structures are similar and a score less than 0.17 means that 
the prediction is nothing more than a random selection from the PDB library. 

For each query protein, we selected four candidates, called STl, ST2, ST3 and DT, 
covering the whole spectrum of predictions that can be generated by protein threading, 
from good to very poor ( Taly et al. 2008 1 as described in Table [s] and [6J We selected 
essentially candidate structures for which no decision could have been made since they 
were scored in the FROST uncertainty zone. According to the TM-score, STl and ST2 
are considered as similar to the native structure, while ST3 and DT are not. For STl 
and ST2, the alignment of the query sequence onto the candidate structure is good or 
fair; sequence similarity is higher for STl than ST2. ST3 is a poorer candidate since 
it is certainly not an homolog of the query and the alignment is much poorer. For DT, 
the query sequence has been aligned with a structure that only shares few structural 
elements with the native structure. Differences between the native structures and the 
corresponding predicted structures are illustrated on Figure [4]for ItqgA and on Figure 
[5] for lk77A. 

Using ABC-MC, we then estimate the Bayes factors between model NS corre- 
sponding to the true structure and models STl, ST2, ST3, and DT, correspond- 
ing to the predicted structures. Each model is an Ising model with sufficient statistic 
5m(x) — J^-JZ-'^ixi-Xi,}- The scalar parameter Om of the Ising model m is assumed 
to have a uniform prior on the interval [0,4]. Simulated data sets were obtained by a 
standard Gibbs sampler. The Gibbs algorithm has been iterated 1000 times, which is a 
sufficient number of iterations for stabilisation. We picked e as the empirical l%_quantile 
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STt" 





ST3 



Figure 5: Superposition of the native structure of lk77A (grey) with the STl structure 
(red), the ST2 structure (orange), the ST3 structure (green), and the DT structure 
(blue). 





% seq. Id. 


TM-score 


FROST score 


liSnA (STl) 


32 


0.86 


75.3 


llslAl (ST2) 


5 


0.42 


8.9 


IjrSA (ST3) 


4 


0.24 


8.9 


ls7oA (DT) 


10 


0.08 


7.8 



Table 5: Summary of the characteristics of our dataset for the protein ItqgA. % seq. 
Id.: percentage of identity with the query sequence. TM-score: similarity between a 
predicted structure and the native structure. FROST score: quality of the alignment of 
the query onto the candidate structure. 





% seq. Id. 


TM-score 


FROST score 


li60A (STl) 


16 


0.69 


8.9 


IqtwA (ST2) 


6 


0.54 


9.8 


IqpoAl (ST3) 


9 


0.29 


9.3 


lm4oA (DT) 


7 


0.17 


8.3 



Table 6: Summary of the characteristics of our dataset for the protein lk77A. % seq. 
Id.: percentage of identity with the query sequence. TM-score: similarity between a 
predicted structure and the native structure. FROST score: quality of the alignment of 
the query onto the candidate structure. 
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NS/STl 



NS/ST2 



NS/ST3 



NS/DT 



BF 



1.34 



1.22 



2.42 



2.76 



Table 7: Estimates of the Bayes factors between model NS and models STl, ST2, 
ST3, and DT, based on an ABC-MC algorithm using 1,2 x 10^ simulations and a 
tolerance e equal to the 1% quantile of the distances for the query protein ItqgA. 



NS/STl 



NS/ST2 



NS/ST3 



NS/DT 



BF 



1.07 



1.14 



11997 



14.24 



Table 8: Estimates of the Bayes factors between model NS and models STl, ST2, 
ST3, and DT, NS based on an ABC-MC algorithm using 1, 2 x 10^ simulations and a 
tolerance e equal to the 1% quantile of the distances for the query protein lk77A. 



of the Euclidean distance p{S{x°), S{x*)). 

Estimated values for the Bayes factors of model NS against each alternative are given 
in Tables [T] and [8] As expected, all estimated Bayes factors are larger than 1 indicating 
that the data is always in favour of the native structure, when compared with one of 
the four alternatives and Bayes factors increase when the similarity between candidate 
and native structure is lower. Moreover, we can classify the candidate structures into 
two categories: for STl and ST2, the evidence is weak in favour of the native structure 
while the evidence is substantial or strong when the alternative is ST3 or DT. Thus 
our approach can distinguish similar from dissimilar structures, even when they were 
scored in the FROST uncertainty zone. 



4 Discussion 

This paper has hopefully demonstrated that the auxiliary variable technique that sup- 
ports the ABC algorithm can be used to overcome the lack of closed-form normalising 
constants in Gibbs random field models and in particular in Ising models. The computa- 
tion of Bayes factors can therefore follow from a standard Monte Carlo simulation that 
includes the model index without requiring advanced techniques like reversible jump 



moves (Robert and Casella 20041. The usual approximation inherent to ABC methods 



can furthermore be avoided due to the availability of a sufficient statistic across models. 
However, the toy example studied above shows that the accuracy of the approximation 
to the posterior probabilities and to the Bayes factor can be greatly improved by re- 
sorting to the original ABC approach, since it allows for the inclusion of many more 
simulations. In the biophysical application to the choice of a folding structure for two 
proteins, we have also demonstrated that we can implement the ABC solution on real- 
istic datasets and, in the examples processed there, that the Bayes factors allow for a 
ranking more standard methods do not. 
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